AITopics | long-form generation

Collaborating Authors

long-form generation

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

SeSE: A Structural Information-Guided Uncertainty Quantification Framework for Hallucination Detection in LLMs

Zhao, Xingtao, Peng, Hao, Su, Dingli, Zeng, Xianghua, Liu, Chunyang, Liao, Jinzhi, Yu, Philip S.

arXiv.org Artificial IntelligenceDec-5-2025

Reliable uncertainty quantification (UQ) is essential for deploying large language models (LLMs) in safety-critical scenarios, as it enables them to abstain from responding when uncertain, thereby avoiding ``hallucinating'' falsehoods. However, state-of-the-art UQ methods primarily rely on semantic probability distributions or pairwise distances, overlooking latent semantic structural information that could enable more precise uncertainty estimates. This paper presents Semantic Structural Entropy (SeSE), a principled UQ framework that quantifies the inherent semantic uncertainty of LLMs from a structural information perspective for hallucination detection. SeSE operates in a zero-resource manner and is applicable to both open- and closed-source LLMs, making it an ``off-the-shelf" solution for new models and tasks. Specifically, to effectively model semantic spaces, we first develop an adaptively sparsified directed semantic graph construction algorithm that captures directional semantic dependencies while automatically pruning unnecessary connections that introduce negative interference. We then exploit latent semantic structural information through hierarchical abstraction: SeSE is defined as the structural entropy of the optimal semantic encoding tree, formalizing intrinsic uncertainty within semantic spaces after optimal compression. A higher SeSE value corresponds to greater uncertainty, indicating that LLMs are highly likely to generate hallucinations. In addition, to enhance fine-grained UQ in long-form generation, we extend SeSE to quantify the uncertainty of individual claims by modeling their random semantic interactions, providing theoretically explicable hallucination detection. Extensive experiments across 29 model-dataset combinations show that SeSE significantly outperforms advanced UQ baselines.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2511.16275

Country: North America > United States (0.46)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.92)

Industry: Information Technology (0.92)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Train for Truth, Keep the Skills: Binary Retrieval-Augmented Reward Mitigates Hallucinations

Chen, Tong, Asai, Akari, Zettlemoyer, Luke, Hajishirzi, Hannaneh, Brahman, Faeze

arXiv.org Artificial IntelligenceOct-21-2025

Language models often generate factually incorrect information unsupported by their training data, a phenomenon known as extrinsic hallucination. Existing mitigation approaches often degrade performance on open-ended generation and downstream tasks, limiting their practical utility. We propose an online reinforcement learning method using a novel binary retrieval-augmented reward (RAR) to address this tradeoff. Unlike continuous reward schemes, our approach assigns a reward of one only when the model's output is entirely factually correct, and zero otherwise. We evaluate our method on Qwen3 reasoning models across diverse tasks. For open-ended generation, binary RAR achieves a 39.3% reduction in hallucination rates, substantially outperforming both supervised training and continuous-reward RL baselines. In short-form question answering, the model learns calibrated abstention, strategically outputting "I don't know" when faced with insufficient parametric knowledge. This yields 44.4% and 21.7% fewer incorrect answers on PopQA and GPQA, respectively. Crucially, these factuality gains come without performance degradation on instruction following, math, or code, whereas continuous-reward RL, despite improving factuality, induces quality regressions.

large language model, machine learning, reinforcement learning, (22 more...)

arXiv.org Artificial Intelligence

2510.17733

Country:

North America > United States (1.00)
Asia (0.68)

Genre: Research Report (0.83)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.68)

Add feedback

UNCLE: Benchmarking Uncertainty Expressions in Long-Form Generation

Yang, Ruihan, Zhang, Caiqi, Zhang, Zhisong, Huang, Xinting, Yu, Dong, Collier, Nigel, Yang, Deqing

arXiv.org Artificial IntelligenceOct-10-2025

Large Language Models (LLMs) are prone to hallucination, particularly in long-form generations. A promising direction to mitigate hallucination is to teach LLMs to express uncertainty explicitly when they lack sufficient knowledge. However, existing work lacks direct and fair evaluation of LLMs' ability to express uncertainty effectively in long-form generation. To address this gap, we first introduce UNCLE, a benchmark designed to evaluate uncertainty expression in both long- and short-form question answering (QA). UNCLE covers five domains and includes more than 1,000 entities, each with paired short- and long-form QA items. Our dataset is the first to directly link short- and long-form QA through aligned questions and gold-standard answers. Along with UNCLE, we propose a suite of new metrics to assess the models' capabilities to selectively express uncertainty. We then demonstrate that current models fail to convey uncertainty appropriately in long-form generation. We further explore both prompt-based and training-based methods to improve models' performance, with the training-based methods yielding greater gains. Further analysis of alignment gaps between short- and long-form uncertainty expression highlights promising directions for future research using UNCLE.

express uncertainty, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2505.16922

Country:

North America > United States (0.68)
Europe > United Kingdom > England (0.28)

Genre: Research Report > New Finding (0.67)

Industry:

Media > Film (0.69)
Leisure & Entertainment (0.69)
Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.47)
Health & Medicine > Therapeutic Area > Immunology (0.47)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.96)

Add feedback

Topic Modeling as Long-Form Generation: Can Long-Context LLMs revolutionize NTM via Zero-Shot Prompting?

Xu, Xuan, Li, Haolun, Yang, Zhongliang, Chu, Beilin, Song, Jia, Xu, Moxuan, Zhou, Linna

arXiv.org Artificial IntelligenceOct-6-2025

Traditional topic models such as neural topic models rely on inference and generation networks to learn latent topic distributions. This paper explores a new paradigm for topic modeling in the era of large language models, framing TM as a long-form generation task whose definition is updated in this paradigm. We propose a simple but practical approach to implement LLM-based topic model tasks out of the box (sample a data subset, generate topics and representative text with our prompt, text assignment with keyword match). We then investigate whether the long-form generation paradigm can beat NTMs via zero-shot prompting. We conduct a systematic comparison between NTMs and LLMs in terms of topic quality and empirically examine the claim that "a majority of NTMs are outdated."

computational linguistic, large language model, natural language, (13 more...)

arXiv.org Artificial Intelligence

2510.03174

Country:

North America > United States (0.28)
North America > Mexico (0.28)
Asia > Middle East (0.28)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

TruthTorchLM: A Comprehensive Library for Predicting Truthfulness in LLM Outputs

Yaldiz, Duygu Nur, Bakman, Yavuz Faruk, Kang, Sungmin, Öziş, Alperen, Yildiz, Hayrettin Eren, Shah, Mitash Ashish, Huang, Zhiqi, Kumar, Anoop, Samuel, Alfy, Liu, Daben, Karimireddy, Sai Praneeth, Avestimehr, Salman

arXiv.org Artificial IntelligenceJul-14-2025

Generative Large Language Models (LLMs)inevitably produce untruthful responses. Accurately predicting the truthfulness of these outputs is critical, especially in high-stakes settings. To accelerate research in this domain and make truthfulness prediction methods more accessible, we introduce TruthTorchLM an open-source, comprehensive Python library featuring over 30 truthfulness prediction methods, which we refer to as Truth Methods. Unlike existing toolkits such as Guardrails, which focus solely on document-grounded verification, or LM-Polygraph, which is limited to uncertainty-based methods, TruthTorchLM offers a broad and extensible collection of techniques. These methods span diverse tradeoffs in computational cost, access level (e.g., black-box vs white-box), grounding document requirements, and supervision type (self-supervised or supervised). TruthTorchLM is seamlessly compatible with both HuggingFace and LiteLLM, enabling support for locally hosted and API-based models. It also provides a unified interface for generation, evaluation, calibration, and long-form truthfulness prediction, along with a flexible framework for extending the library with new methods. We conduct an evaluation of representative truth methods on three datasets, TriviaQA, GSM8K, and FactScore-Bio. The code is available at https://github.com/Ybakman/TruthTorchLM

computational linguistic, large language model, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2507.08203

Country:

Europe (0.68)
North America > United States > New Jersey > Essex County (0.14)

Genre: Research Report (0.82)

Industry:

Media (1.00)
Leisure & Entertainment > Sports > Soccer (0.48)
Education > Educational Setting (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.71)

Add feedback

LongWriter-Zero: Mastering Ultra-Long Text Generation via Reinforcement Learning

Wu, Yuhao, Bai, Yushi, Hu, Zhiqiang, Lee, Roy Ka-Wei, Li, Juanzi

arXiv.org Artificial IntelligenceJun-24-2025

Ultra-long generation by large language models (LLMs) is a widely demanded scenario, yet it remains a significant challenge due to their maximum generation length limit and overall quality degradation as sequence length increases. Previous approaches, exemplified by LongWriter, typically rely on ''teaching'', which involves supervised fine-tuning (SFT) on synthetic long-form outputs. However, this strategy heavily depends on synthetic SFT data, which is difficult and costly to construct, often lacks coherence and consistency, and tends to be overly artificial and structurally monotonous. In this work, we propose an incentivization-based approach that, starting entirely from scratch and without relying on any annotated or synthetic data, leverages reinforcement learning (RL) to foster the emergence of ultra-long, high-quality text generation capabilities in LLMs. We perform RL training starting from a base model, similar to R1-Zero, guiding it to engage in reasoning that facilitates planning and refinement during the writing process. To support this, we employ specialized reward models that steer the LLM towards improved length control, writing quality, and structural formatting. Experimental evaluations show that our LongWriter-Zero model, trained from Qwen2.5-32B, consistently outperforms traditional SFT methods on long-form writing tasks, achieving state-of-the-art results across all metrics on WritingBench and Arena-Write, and even surpassing 100B+ models such as DeepSeek R1 and Qwen3-235B. We open-source our data and model checkpoints under https://huggingface.co/THU-KEG/LongWriter-Zero-32B

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2506.18841

Country:

Asia > Singapore (0.04)
North America > United States (0.04)
Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
Asia > China > Beijing > Beijing (0.04)

Genre: Research Report > New Finding (0.93)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Reconsidering LLM Uncertainty Estimation Methods in the Wild

Bakman, Yavuz, Yaldiz, Duygu Nur, Kang, Sungmin, Zhang, Tuo, Buyukates, Baturalp, Avestimehr, Salman, Karimireddy, Sai Praneeth

arXiv.org Artificial IntelligenceJun-3-2025

Large Language Model (LLM) Uncertainty Estimation (UE) methods have become a crucial tool for detecting hallucinations in recent years. While numerous UE methods have been proposed, most existing studies evaluate them in isolated short-form QA settings using threshold-independent metrics such as AUROC or PRR. However, real-world deployment of UE methods introduces several challenges. In this work, we systematically examine four key aspects of deploying UE methods in practical settings. Specifically, we assess (1) the sensitivity of UE methods to decision threshold selection, (2) their robustness to query transformations such as typos, adversarial prompts, and prior chat history, (3) their applicability to long-form generation, and (4) strategies for handling multiple UE scores for a single query. Our evaluations on 19 UE methods reveal that most of them are highly sensitive to threshold selection when there is a distribution shift in the calibration dataset. While these methods generally exhibit robustness against previous chat history and typos, they are significantly vulnerable to adversarial prompts. Additionally, while existing UE methods can be adapted for long-form generation through various strategies, there remains considerable room for improvement. Lastly, ensembling multiple UE scores at test time provides a notable performance boost, which highlights its potential as a practical improvement strategy. Code is available at: https://github.com/duygunuryldz/uncertainty_in_the_wild.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2506.01114

Country:

Asia (1.00)
North America > United States > New Jersey (0.28)

Genre: Research Report > New Finding (0.68)

Industry:

Media (1.00)
Leisure & Entertainment > Games > Computer Games (1.00)
Education (0.92)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.96)

Add feedback

LoGU: Long-form Generation with Uncertainty Expressions

Yang, Ruihan, Zhang, Caiqi, Zhang, Zhisong, Huang, Xinting, Yang, Sen, Collier, Nigel, Yu, Dong, Yang, Deqing

arXiv.org Artificial IntelligenceOct-24-2024

While Large Language Models (LLMs) demonstrate impressive capabilities, they still struggle with generating factually incorrect content (i.e., hallucinations). A promising approach to mitigate this issue is enabling models to express uncertainty when unsure. Previous research on uncertainty modeling has primarily focused on short-form QA, but realworld applications often require much longer responses. In this work, we introduce the task of Long-form Generation with Uncertainty(LoGU). We identify two key challenges: Uncertainty Suppression, where models hesitate to express uncertainty, and Uncertainty Misalignment, where models convey uncertainty inaccurately. To tackle these challenges, we propose a refinement-based data collection framework and a two-stage training pipeline. Our framework adopts a divide-and-conquer strategy, refining uncertainty based on atomic claims. The collected data are then used in training through supervised fine-tuning (SFT) and direct preference optimization (DPO) to enhance uncertainty expression. Extensive experiments on three long-form instruction following datasets show that our method significantly improves accuracy, reduces hallucinations, and maintains the comprehensiveness of responses.

artificial intelligence, large language model, natural language, (17 more...)

arXiv.org Artificial Intelligence

2410.14309

Country:

Europe > United Kingdom > England > Greater London > London (0.04)
Asia > Thailand > Bangkok > Bangkok (0.04)
Asia > Singapore (0.04)
(7 more...)

Genre:

Research Report > New Finding (0.67)
Research Report > Promising Solution (0.48)
Personal > Obituary (0.46)

Industry:

Law Enforcement & Public Safety > Crime Prevention & Enforcement (0.93)
Health & Medicine > Health Care Technology > Telehealth (0.46)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Atomic Calibration of LLMs in Long-Form Generations

Zhang, Caiqi, Yang, Ruihan, Zhang, Zhisong, Huang, Xinting, Yang, Sen, Yu, Dong, Collier, Nigel

arXiv.org Artificial IntelligenceOct-17-2024

Large language models (LLMs) often suffer from hallucinations, posing significant challenges for real-world applications. Confidence calibration, which estimates the underlying uncertainty of model predictions, is essential to enhance the LLMs' trustworthiness. Existing research on LLM calibration has primarily focused on short-form tasks, providing a single confidence score at the response level (macro calibration). However, this approach is insufficient for long-form generations, where responses often contain more complex statements and may include both accurate and inaccurate information. Therefore, we introduce atomic calibration, a novel approach that evaluates factuality calibration at a fine-grained level by breaking down long responses into atomic claims. We classify confidence elicitation methods into discriminative and generative types and demonstrate that their combination can enhance calibration. Our extensive experiments on various LLMs and datasets show that atomic calibration is well-suited for long-form generation and can also improve macro calibration results. Additionally, atomic calibration reveals insightful patterns in LLM confidence throughout the generation process.

calibration, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2410.13246

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
Asia > Singapore (0.04)
Asia > Thailand > Bangkok > Bangkok (0.04)
(2 more...)

Genre: Research Report > New Finding (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)

Add feedback

Linguistic Calibration of Language Models

Band, Neil, Li, Xuechen, Ma, Tengyu, Hashimoto, Tatsunori

arXiv.org Machine LearningMar-30-2024

Language models (LMs) may lead their users to make suboptimal downstream decisions when they confidently hallucinate. This issue can be mitigated by having the LM verbally convey the probability that its claims are correct, but existing models cannot produce text with calibrated confidence statements. Through the lens of decision-making, we formalize linguistic calibration for long-form generations: an LM is linguistically calibrated if its generations enable its users to make calibrated probabilistic predictions. This definition enables a training framework where a supervised finetuning step bootstraps an LM to emit long-form generations with confidence statements such as "I estimate a 30% chance of..." or "I am certain that...", followed by a reinforcement learning step which rewards generations that enable a user to provide calibrated answers to related questions. We linguistically calibrate Llama 2 7B and find in automated and human evaluations of long-form generations that it is significantly more calibrated than strong finetuned factuality baselines with comparable accuracy. These findings generalize under distribution shift on question-answering and under a significant task shift to person biography generation. Our results demonstrate that long-form generations may be calibrated end-to-end by constructing an objective in the space of the predictions that users make in downstream decision-making.

calibration, forecaster, long-form generation, (15 more...)

arXiv.org Machine Learning

2404.00474

Country:

South America > Colombia (0.04)
Asia > Singapore (0.04)
Asia > Indonesia > Bali (0.04)
(12 more...)

Genre: Research Report > New Finding (0.68)

Industry:

Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
Health & Medicine (0.93)
Education (0.67)
Leisure & Entertainment > Sports > Motorsports (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.87)

Add feedback